Leveraging Diverse Sources in Statistical Machine Translation
نویسنده
چکیده
Statistical machine translation is often faced with the problem of having insufficient training data for many language pairs. In this thesis, several methods have been proposed to leverage other sources to enhance the quality of machine translation systems. Particularly, we propose approaches suitable in these four scenarios: 1. when an additional parallel corpus between the source and the target language is available (ensemble decoding); 2. when parallel corpora between the source language and a third language and between that language and the target language are available (ensemble triangulation); 3. when an abundant source language monolingual corpus is available (graph propagation for paraphrasing out-of-vocabulary words); 4. when no additional resource is available (stacking). In the heart of these solutions lie two novel approaches: ensemble decoding and a graph propagation approach for paraphrasing out-of-vocabulary (oov) words. Ensemble decoding combines a number of translation systems dynamically at the decoding step. We evaluate its performance on a domain adaptation setting where a model trained on a large parliamentary domain is adapted to the medical domain, we then translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. We extend ensemble decoding to do triangulation on-the-fly when there exist parallel corpora between the source language and one or multiple pivot languages and between those and the target language. These triangulated systems are dynamically combined together
منابع مشابه
Large and Diverse Language Models for Statistical Machine Translation
This paper presents methods to combine large language models trained from diverse text sources and applies them to a state-ofart French–English and Arabic–English machine translation system. We show gains of over 2 BLEU points over a strong baseline by using continuous space language models in re-ranking.
متن کاملBilingually-Constrained Recursive Neural Networks with Syntactic Constraints for Hierarchical Translation Model
Hierarchical phrase-based translation models have advanced statistical machine translation (SMT). Because such models can improve leveraging of syntactic information, two types of methods (leveraging source parsing and leveraging shallow parsing) are applied to introduce syntactic constraints into translation models. In this paper, we propose a bilingually-constrained recursive neural network (...
متن کاملMixing Multiple Translation Models in Statistical Machine Translation
Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation se...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملMachine Translation with Real-Time Web Search
Contemporary machine translation systems usually rely on offline data retrieved from the web for individual model training, such as translation models and language models. In contrast to existing methods, we propose a novel approach that treats machine translation as a web search task and utilizes the web on the fly to acquire translation knowledge. This end-to-end approach takes advantage of f...
متن کامل